모형 평가

Author

김보람

Published

May 11, 2023

ref

선형대수와 통계학으로 배우는 머신러닝 with 파이썬
github

파이프라인

from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

raw_boston = datasets.load_boston()

X = raw_boston.data
y = raw_boston.target

# 트레이닝 / 테스트 데이터 분할
X_tn, X_te, y_tn, y_te = train_test_split(X,y,random_state=7)

# 표준화 스케일링
std_scale = StandardScaler()
X_tn_std = std_scale.fit_transform(X_tn)
X_te_std  = std_scale.transform(X_te)

# 학습
clf_linear =  LinearRegression()
clf_linear.fit(X_tn_std, y_tn)

# 예측
pred_linear = clf_linear.predict(X_te_std)

# 평가
mean_squared_error(y_te, pred_linear)

# 트레이닝 / 테스트 데이터 분할
X_tn, X_te, y_tn, y_te = train_test_split(X,y,random_state=7)

# 파이프라인
linear_pipline = Pipeline([
    ('scaler',StandardScaler()), 
    ('linear_regression', LinearRegression()) 
])

# 학습
linear_pipline.fit(X_tn, y_tn)

# 예측
pred_linear = linear_pipline.predict(X_te)

# 평가
mean_squared_error(y_te, pred_linear)

그리드 서치

머신러닝 과정에서 관심 있는 매개 변수들을 대상으로 학습하도록 만드는 방식

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# 꽃 데이터 불러오기
raw_iris = datasets.load_iris()

# 피쳐 / 타겟
X = raw_iris.data
y = raw_iris.target

# 트레이닝 / 테스트 데이터 분할
X_tn, X_te, y_tn, y_te=train_test_split(X,y,random_state=0)

# 표준화 스케일
std_scale = StandardScaler()
std_scale.fit(X_tn)
X_tn_std = std_scale.transform(X_tn)
X_te_std  = std_scale.transform(X_te)

best_accuracy = 0

for k in [1,2,3,4,5,6,7,8,9,10]:
    clf_knn =  KNeighborsClassifier(n_neighbors=k)
    clf_knn.fit(X_tn_std, y_tn)
    knn_pred = clf_knn.predict(X_te_std)
    accuracy = accuracy_score(y_te, knn_pred)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        final_k = {'k': k}
    
print(final_k)
print(accuracy)

{'k': 3}
0.9736842105263158

모형 성능 평가

분류(classification)

정확도(accuracy)

\[\dfrac{1}{n}\sum_{i=1}^n I(\hat y_i = y_i)\]

# 정확도
#import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
print(accuracy_score(y_true, y_pred))
print(accuracy_score(y_true, y_pred, normalize=False))

0.5
2

## confusionm matrix
from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)

array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

## classification report 
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 1, 0]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.67      1.00      0.80         2
     class 1       0.00      0.00      0.00         1
     class 2       1.00      0.50      0.67         2

    accuracy                           0.60         5
   macro avg       0.56      0.50      0.49         5
weighted avg       0.67      0.60      0.59         5

회귀(regression)

Mean Absolute Error

\[MAE=\dfrac{1}{n}\sum_{i=1}^n|y_i-\hat y_i|\]

# mean absolute error
from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

print(mean_absolute_error(y_true, y_pred))

0.5

Mean Squared Error

\[MSE=\dfrac{1}{n}\sum_{i=1}^n(y_i-\hat y_i)^2\]

# mean squared error
from sklearn.metrics import mean_squared_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print(mean_squared_error(y_true, y_pred))

0.375

# R2
from sklearn.metrics import r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print(r2_score(y_true, y_pred))

0.9486081370449679

군집(clustering)

- 실루엣 스코어(silhouette score)

서로 다른 군집이 얼마나 잘 분리되는지 나타내는 지표
같은 군집의 데이터는 가까운 거리에 뭉쳐 있고, 다른 군집의 데이터끼리는 멀리 떨어져 있을 수록 높은 점수
a: 집단 내 데이터 거리 평균
b: 다른 집단 데이터 거리 평균의 최솟값
실루엣 스코어는 -1~1의 값으 ㄹ가지며 스코어가 높을 수록 좋은 성능

\[s=\dfrac{b-a}{max(a,b)}\]

# silloutte score
from sklearn.metrics import silhouette_score
X = [[1, 2], [4, 5], [2, 1], [6, 7], [2, 3]]
labels = [0, 1, 0, 1, 0] 
sil_score = silhouette_score(X, labels)
print(sil_score)

0.5789497702625118

# adjusted rand index
from sklearn.metrics import adjusted_rand_score
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

print(adjusted_rand_score(labels_true, labels_pred))

0.24242424242424243